perceptron.tex

\chapter{The Perceptron}
\label{chap-perceptron}
First of all, the coolest algorithm name! \note{Well, maybe
  ``neocognitron,'' also the name of a real ML algorithm, is cooler.}
It is based on the 1943 model of neurons made by McCulloch and Pitts
and by Hebb.  It was developed by Rosenblatt in 1962.  At the time, it
was not interpreted as attempting to optimize any particular criteria;
it was presented directly as an algorithm.  There has, since, been a
huge amount of study and analysis of its convergence properties and
other aspects of its behavior.

\section{Algorithm}

Recall that we have a training dataset $\dataTrain$ with $x \in \R^d$,
and $y\in \{-1, +1\}$.  The Perceptron algorithm trains a binary
classifier $h(x; \theta, \theta_0)$ using the following algorithm to
find $\theta$ and $\theta_0$ using (at most) $\tau$ \anchorednote{iterative
  steps:}{We use Greek letter $\tau$ here instead of $T$ so we don't
  confuse it with transpose!}

\begin{codebox}
  \Procname{$\proc{Perceptron}(\tau, \dataTrain)$}
  \li $\theta \gets
    \begin{bmatrix}
      0 & 0 & \cdots & 0
    \end{bmatrix}^T$
  \li $\theta_0 \gets 0$
  \li \For $t \gets 1$ \To $\tau$
  \li   \Do
  changed = False
  \li        \For $i \gets 1$ \To $n$
  \li       \Do
  \If $\ex{y}{i}\left(\theta^T\ex{x}{i} + \theta_0\right) \le 0$
  \li           \Then
  $\theta \gets \theta + \ex{y}{i}\ex{x}{i}$
  \li             $\theta_0 \gets \theta_0 + \ex{y}{i}$
  \li             changed = True
  \End
  \End
  \li      \If {\sc not} changed
  \li          \Then
  break
  \End
  \End
  \li \Return $\theta, \theta_0$
\end{codebox}

\note{Let's check dimensions.  Remember that $\theta$ is $d \times 1$,
  $\ex{x}{i}$ is $d \times 1$, and $\ex{y}{i}$ is a scalar.  Does
  everything match?}

Intuitively, on each step, if the current hypothesis
$\theta, \theta_0$ classifies example $\ex{x}{i}$ correctly, then no
change is made.  If it classifies $\ex{x}{i}$ incorrectly, then it
moves $\theta, \theta_0$ so that it is ``closer'' to classifying
$x^{(i)}, y^{(i)}$ correctly.

Note that if the algorithm ever steps once through every value of $i$
without making an update, it will never make any further updates (this
is true even if the ``break'' statement were removed; verify that you
believe this!) and so it should just terminate at that point.
\question {\em What is true about $\trainerr$ if that happens?}

\begin{examplebox}{\bf Example:}
  Let $h$ be the linear classifier defined by $\ex{\theta}{0} =
    \protect\twodcol{1}{-1}, \ex{\theta_0}{0} = 1$.
  \noindent
  The diagram below shows several points classified by $h$.  However, in
  this case, $h$ (represented by the bold line and normal vector $\ex{\theta}{0}$) misclassifies the point
  $\ex{x}{1} = \twodcol{1}{3}$ which has label $\ex{y}{1} = 1$.  Indeed,
  $$ \ex{y}{1}\left(\theta^T \ex{x}{1} + \theta_0\right) =  \twodrow{1}{-1}
    \twodcol{1}{3} + 1 = -1 < 0 $$
  By running an iteration of the Perceptron algorithm, we update
  $$ \ex{\theta}{1} = \ex{\theta}{0} + \ex{y}{1}\ex{x}{1} = \twodcol{2}{2} $$
  $$ \ex{\theta_0}{1} = \ex{\theta_0}{0} + \ex{y}{1} = 2 $$
  The new classifier (represented by the dashed line and normal vector $\ex{\theta}{1}$) now correctly
  classifies that point, but now makes a mistake on the negatively
  labeled point.

  \begin{center}
    \begin{tikzpicture}
      \draw [thin, gray!40] (-3,-3) grid (4,4);
      \draw [<->] (-3,0) -- (4,0);
      \draw [<->] (0,-3) -- (0,4);

      \draw [ultra thick] (-3, -2) -- (3, 4)
      node [above right] {\large${\ex{\theta}{0}}^Tx + \ex{\theta_0}{0} = 0$};

      \draw [thick,teal,-latex] (0.5, 1.5) -- (1.4, 0.6)
      node [black,above right] {$\ex{\theta}{0}$};

      %  \draw [ultra thick, dashed] (-2, 4) -- (4, -2)
      \draw [ultra thick, dashed] (-3, 2) -- (2, -3)
      node [black,above right] {\large${\ex{\theta}{1}}^Tx + \ex{\theta_0}{1} = 0$};

      %  \draw [thick,teal,-latex] (1, 1) -- (1.9, 1.9)
      %  \draw [thick,teal,-latex] (-1, -1) -- (-0.1, -0.1)
      \draw [thick,teal,-latex] (0.0, -1.0) -- (0.9, -0.1)
      node [black,below right] {$\ex{\theta}{1}$};

      \node [above right,xshift=.3em] at (1,3) {\large$\ex{x}{1}$};
      \foreach \Point in {(1, 3), (2.5, 1.5)}{
          \pic at \Point {plus};
        }

      \foreach \Point in {(-1.5, 1.5)}{
          \pic at \Point {minus};
        }
    \end{tikzpicture}
  \end{center}
\end{examplebox}

A really important fact about the perceptron algorithm is that, if
there is a linear classifier with 0 training error, then
for large enough $\tau$
this algorithm will (eventually) find it!  We'll look at a proof of this in
detail, next.

\section{Offset}
\label{sec-offset}
Sometimes, it can be easier to implement or analyze classifiers of the
form
\begin{eqnarray*}
  h(x; \theta) =
  \begin{cases}
    +1 & \text{if } \theta^Tx > 0 \\
    -1 & \text{otherwise.}
  \end{cases}
\end{eqnarray*}
Without an explicit offset term ($\theta_0$), this separator must pass
through the origin, which may appear to be limiting. However, we can
convert any problem involving a linear separator \emph{with} offset
into one with \emph{no} offset (but of higher dimension)!

Consider the $d$-dimensional linear separator defined by
$\theta = \begin{bmatrix} \theta_1 & \theta_2 & \cdots
                         & \theta_d\end{bmatrix}$ and offset $\theta_0$.
\begin{itemize}
  \item for each data point
        $(x, y) \in \dataTrain$, append a coordinate with value +1 to $x$, yielding
        $$x_{\rm new} = \begin{bmatrix} x_1  & \cdots & x_d & +1 \end{bmatrix}^T$$
  \item define $$\theta_{\rm new} = \begin{bmatrix} \theta_1 & \cdots   &
                \theta_d & \theta_0\end{bmatrix}^T$$
\end{itemize}
Then,
\begin{align*}
  \theta_{\rm new}^T \cdot x_{\rm new} & = \theta_1x_1 + \cdots + \theta_dx_d
  + \theta_0 \cdot 1                                                          \\
                                       & = \theta^Tx + \theta_0
\end{align*}
Thus, $\theta_{\rm new}$ is an equivalent ($(d+1)$-dimensional) separator to
our original, but with no offset.

Consider the data set:
\begin{align*}
  X & = [[1], [2], [3], [4]]     \\
  Y & = [[+1], [+1], [-1], [-1]] \\
\end{align*}
It is linearly separable in $d = 1$ with $\theta = [-1]$ and $\theta_0
  = 2.5$.  But it is not linearly separable through the origin!   Now,
let
\[X_{\rm new} = \begin{bmatrix}
    \begin{bmatrix} 1 \\ 1 \end{bmatrix}
    \begin{bmatrix} 2 \\ 1 \end{bmatrix}
    \begin{bmatrix} 3 \\ 1 \end{bmatrix}
    \begin{bmatrix} 4 \\ 1 \end{bmatrix}
  \end{bmatrix}\]
This new dataset is separable through the origin, with $\theta_{\rm
    new} = [-1, 2.5]^T$.


% \lpknote{TAs:  Can you add the figure / example from the handwritten
%   notes?  (Just the example with 4 points going from 1 to 2
%   dimensions.)}

We can make a simplified version of the perceptron algorithm if we
restrict ourselves to separators through the origin: \note{We list it
  here because this is the version of the algorithm we'll study in
  more detail.}
\begin{codebox}
  \Procname{$\proc{Perceptron-Through-Origin}(\tau, \dataTrain)$}
  \li $\theta \gets
    \begin{bmatrix}
      0 & 0 & \cdots & 0
    \end{bmatrix}^T$
  \li \For $t \gets 1$ \To $\tau$
  \li   \Do
  changed = False
  \li      \For $i \gets 1$ \To $n$
  \li       \Do
  \If $\ex{y}{i}\left(\theta^T\ex{x}{i}\right) \le 0$
  \li           \Then
  $\theta \gets \theta + \ex{y}{i}\ex{x}{i}$
  \li             changed = True
  \End
  \End
  \li      \If {\sc not} changed
  \li          \Then
  break
  \End
  \li \Return $\theta$
\end{codebox}


\section{Theory of the perceptron}
Now, we'll say something formal about how well the perceptron
algorithm really works.  We start by characterizing the set of
problems that can be solved perfectly by the perceptron algorithm, and
then prove that, in fact, it can solve these problems.  In addition,
we provide a notion of what makes a problem difficult for perceptron
and link that notion of difficulty to the number of updates the
algorithm will make.

\subsection{Linear separability}
A training set $\dataTrain$ is {\em{linearly separable}} if there exist
$\theta, \theta_0$ such that, for all $i = 1, 2, \dots, n$:
$$ \ex{y}{i}\left(\theta^T\ex{x}{i} + \theta_0\right) > 0 \;\;.$$
Another way to say that $\dataTrain$ is linearly separable
is that there exists an $h$ such that all predictions on the training set
are correct,
$$ h(\ex{x}{i}; \theta, \theta_0) = \left(\theta^T\ex{x}{i} + \theta_0\right) =
  \ex{y}{i} \;\;,$$
in which case we call $h$ a {\em linear separator}.

And, another way to say that $\dataTrain$ is linearly separable
is that there exists an $h$ of the form above such that
the training error is zero,
$$\trainerr(h) = 0 \;\;.$$

\subsection{Convergence theorem}
The basic result about the perceptron is that, if the training data
$\dataTrain$ is linearly separable, then the perceptron algorithm is
guaranteed to find a linear separator, for large
enough $\tau.$ \note{If the training data is
    {\em not} linearly separable, the algorithm will not be able to tell
  you for sure, in finite time, that it is not linearly separable.
  There are other algorithms that can test for linear separability
  with run-times $O(n^{d/2})$ or $O(d^{2n})$ or $O(n^{d-1}\log{n})$.}

We will more specifically characterize the linear separability of the
dataset by the \emph{margin} of the separator.   We'll start by
defining the margin of a point with respect to a hyperplane.

First, recall that the signed distance from the
hyperplane $\theta, \theta_0$ to a point $x$ is
\[ \frac{\theta^Tx + \theta_0}{\norm{\theta}}\;\;. \]
Then, we'll define the {\em margin} of a {\em labeled point} $(x, y)$
with respect to hyperplane $\theta, \theta_0$ to be
\[ y \cdot \frac{\theta^Tx + \theta_0}{\norm{\theta}}\;\;. \]
This quantity will be positive if and only if the point $x$ is
classified as $y$ by the linear classifier represented by this
hyperplane.
\question{What sign does the margin have if the point is incorrectly
  classified?  Be sure you can explain why.}

Now, the {\em margin} of a {\em dataset} $\dataTrain$ with respect to the hyperplane
$\theta, \theta_0$ is the {\em minimum} margin of any point with respect
to $\theta, \theta_0$:
\[\min_i \left(\ex{y}{i} \cdot \frac{\theta^T \ex{x}{i} + \theta_0}{\norm{\theta}}\right)\;\;.\]
The margin is positive if and only if all of the points in the data-set
are classified correctly.  In that case (only!) it represents the
distance from the hyperplane to the closest point.

\begin{examplebox}{\bf Example:}
  Let $h$ be the linear classifier defined by $\theta = \protect\twodcol{1}{-1}, \theta_0 = 1$.

  \bigskip
  \noindent
  The diagram below shows several points classified by $h$, one of which is misclassified.  We compute the margin for each point:

  \begin{center}
    \begin{tikzpicture}
      \draw [thin, gray!40] (-3,-3) grid (4,4);
      \draw [<->] (-3,0) -- (4,0);
      \draw [<->] (0,-3) -- (0,4);

      \draw [ultra thick] (-3, -2) -- (3, 4)
      node [above] {\large$\theta^Tx + \theta_0 = 0$};

      \draw [thick,teal,-latex] (0, 1) -- (0.9, 0.1)
      node [black,below right] {$\theta$};

      \node [above right,xshift=.3em] at (1,3) {\large$\ex{x}{1}$};
      \node [above right,xshift=.3em] at (2.5, 1.5) {\large$\ex{x}{2}$};
      \node [above right,xshift=.3em] at (-1.5, 1.5) {\large$\ex{x}{3}$};
      \foreach \Point in {(1, 3), (2.5, 1.5)}{
          \pic at \Point {plus};
        }

      \foreach \Point in {(-1.5, 1.5)}{
          \pic at \Point {minus};
        }
    \end{tikzpicture}
    $$ \ex{y}{1} \cdot \frac{\theta^T \ex{x}{1} + \theta_0}{\norm{\theta}} = 1 \cdot \frac{-2 + 1}{\sqrt{2}} = -\frac{\sqrt{2}}{2} $$
    $$ \ex{y}{2} \cdot \frac{\theta^T \ex{x}{2} + \theta_0}{\norm{\theta}} = 1 \cdot \frac{1 + 1}{\sqrt{2}} = \sqrt{2} $$
    $$ \ex{y}{3} \cdot \frac{\theta^T \ex{x}{3} + \theta_0}{\norm{\theta}} = -1 \cdot \frac{-3 + 1}{\sqrt{2}} = \sqrt{2} $$
  \end{center}

  Note that since point $\ex{x}{1}$ is misclassified, its margin is negative.  Thus the margin for the whole data set is given by $-\frac{\sqrt{2}}{2}$.
\end{examplebox}

\begin{theorem}[Perceptron Convergence]
  For simplicity, we consider the case where the linear separator must
  pass through the origin, and $\tau$ is infinite. If the following conditions hold:
  \begin{enumerate}[(a)]
    \item there exist $\theta^*$ and $\gamma$ such that $\gamma > 0$, and
          $\ex{y}{i} \frac{\theta^{*T}\ex{x}{i}}{\norm{\theta^*}}
            \geq \gamma$
          for all $i = 1, \ldots, n$,
    \item all the examples have bounded magnitude: $\norm{\ex{x}{i}} \leq R$
          for all $i = 1, \ldots n$,
  \end{enumerate}
  then the perceptron algorithm will make at most $\left(\frac{R}{\gamma}
    \right)^2$ updates to its starting hypothesis. At this point, its hypothesis
  will be a linear separator of the data.
  %\note{this convergence guarantee is independent of $d$ and $n$!!}
\end{theorem}

\begin{proof}
  We initialize $\ex{\theta}{0} = 0$, and let $\ex{\theta}{k}$ define our
  hyperplane after the perceptron algorithm has made $k$ updates to its
  starting hypothesis.
  We are going to think about the angle between the hypothesis we have
  now, $\ex{\theta}{k}$ and the assumed good separator $\theta^*$.
  Since they both go through the origin, if we can show that the angle
  between them is decreasing usefully across iterations, then we will
  get close to that separator.

  So, let's think about the $\cos$ of the angle between them, and
  recall, by the definition of dot product:
  \[\cos\left(\ex{\theta}{k}, \theta^*\right)
    = \frac{\ex{\theta}{k} \cdot \theta^*}{
      \norm{\theta^*}\norm{\ex{\theta}{k}}}\]
  We'll divide this up into two factors,
  \begin{equation}
    \cos\left(\ex{\theta}{k}, \theta^*\right)
    = \left(\frac{\ex{\theta}{k} \cdot
      \theta^*}{\norm{\theta^*}}\right)\cdot
    \left(\frac{1}{\norm{\ex{\theta}{k}}}\right)\;\;,
    \label{eq:cos}
  \end{equation}
  and start by focusing on the first factor.

  Without loss of generality, assume that the $k^{th}$
  update occurs on the $i^{th}$ example $\left(\ex{x}{i}, \ex{y}{i}\right)$.
  \begin{align*}
    \frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\theta^*}}
     & = \frac{\left(\ex{\theta}{k-1} + \ex{y}{i}\ex{x}{i}\right)\cdot \theta^*}
    {\norm{\theta^*}}                                                            \\
     & = \frac{\ex{\theta}{k-1}\cdot \theta^*}{\norm{\theta^*}} + \frac{
    \ex{y}{i}\ex{x}{i}\cdot\theta^*}{\norm{\theta^*}}                            \\
     & \geq \frac{\ex{\theta}{k-1}\cdot \theta^*}{\norm{\theta^*}} + \gamma      \\
     & \geq k\gamma
  \end{align*}
  where we have first applied the margin condition from $(a)$ and then
  applied simple induction.

  Now, we'll look at the second factor in Eq.~\ref{eq:cos}.
  We note that since $\left(\ex{x}{i},\ex{y}{i}\right)$ is classified
  incorrectly, $\ex{y}{i}\left({\ex{\theta}{k-1}}^T\ex{x}{i}\right) \leq 0$.
  Thus,
  \begin{align*}
    \norm{\ex{\theta}{k}}^2 & = \norm{\ex{\theta}{k-1} + \ex{y}{i}\ex{x}{i}}^2 \\
                            & = \norm{\ex{\theta}{k-1}}^2 + 2\ex{y}{i}
    {\ex{\theta}{k-1}}^T\ex{x}{i} + \norm{\ex{x}{i}}^2                         \\
                            & \leq \norm{\ex{\theta}{k-1}}^2 + R^2             \\
                            & \leq kR^2
  \end{align*}
  where we have additionally applied the assumption from $(b)$ and then
  again used simple induction.

  Returning to the definition of the dot product, we have
  \[\cos\left(\ex{\theta}{k}, \theta^*\right)
    = \frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\ex{\theta}{k}}
      \norm{\theta^*}}
    = \left(\frac{\ex{\theta}{k} \cdot \theta^*}{\norm{\theta^*}}\right)
    \frac{1}{\norm{\ex{\theta}{k}}}
    \geq (k\gamma)\cdot \frac{1}{\sqrt{k}R} = \sqrt{k}\cdot\frac{\gamma}{R}\]
  Since the value of the cosine is at most 1, we have
  \begin{align*}
    1 & \geq \sqrt{k} \cdot \frac{\gamma}{R}  \\
    k & \leq \left(\frac{R}{\gamma}\right)^2.
  \end{align*}
\end{proof}

This result endows the margin $\gamma$ of $\dataTrain$ with an
operational meaning: when using the Perceptron algorithm for
classification, at most $(R/\gamma)^2$ updates will be
made, where $R$ is an upper bound on the magnitude of the training
vectors.


%%% Local Variables:
%%% mode: latex
%%% TeX-master: "top"
%%% End: