An ANN is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.
-
The above illustration is a 2 layer neural network. We generally exclude the input layer while counting the number of layers.
-
$$ \begin{align} a^{[0]} &= X \label{eq1}\ a^{[1]} &= \begin{bmatrix} a_1^{[1]} \ a_2^{[1]} \ a_3^{[1]} \ a_4^{[1]} \ \end{bmatrix} \in \mathrm{R^{\text{#hidden_units} \times 1}}\ a^{[2]} &= \hat{y} \label{eq3}\ \end{align} $$
-
The notation,
$a^{[l]}_i$ implies the$i^\text{th}$ node of the$l^{\text{th}}$ layer. -
This image shows a particular neuron and the computations that occur in it :
-
For each of the neurons in 2 layer NN given at the top of this page, we can write the following set of expressions :
-
$$ \begin{align} z_1^{[1]} &= w_1^{[1]T}x + b_1,& \quad a_1^{[1]} &= \sigma{(z_1^{[1]})} \ z_2^{[1]} &= w_2^{[1]T}x + b_2,& \quad a_2^{[1]} &= \sigma{(z_2^{[1]})} \ z_3^{[1]} &= w_3^{[1]T}x + b_3,& \quad a_3^{[1]} &= \sigma{(z_3^{[1]})} \ z_4^{[1]} &= w_4^{[1]T}x + b_4,& \quad a_4^{[1]} &= \sigma{(z_4^{[1]})} \ \end{align} $$
-
Computing the above expressions in a
for
loop can be expensive so we use the vectorised form derived from : -
$$ \begin{align} z^{[1]} &= \underbrace{\begin{bmatrix} \cdots & w_1^{[1]T} &\cdots\ \cdots & w_2^{[1]T} &\cdots\ \cdots & w_3^{[1]T} &\cdots\ \cdots & w_4^{[1]T} &\cdots\ \end{bmatrix}}{W^{[1]} \in \mathrm{R^{4\times3}}} \begin{bmatrix} x_1\ x_2\ x_3\ \end{bmatrix} + \underbrace{\begin{bmatrix} b_1^{[1]}\ b_2^{[1]}\ b_3^{[1]}\ b_4^{[1]}\ \end{bmatrix}}{b^{[1]} \in \mathrm{R^{4\times1}}}
\begin{bmatrix} z_1^{[1]}\ z_2^{[1]}\ z_3^{[1]}\ z_4^{[1]}\ \end{bmatrix} \
a^{[1]} &= \begin{bmatrix} a_1^{[1]}\ a_2^{[1]}\ a_3^{[1]}\ a_4^{[1]}\ \end{bmatrix} = \sigma{(z^{[1]})} \
&\text{or}\
\underbrace{z^{[1]}}{(4,1)} &= \underbrace{W^{[1]}}{(4,3)}\underbrace{x}{(3,1)} + \underbrace{b^{[1]}}{(4,1)} \ \underbrace{a^{[1]}}{4,1} &= \underbrace{\sigma{(z^{[1]})}}{(4,1)} \ \underbrace{z^{[2]}}{(1,1)} &= \underbrace{W^{[2]}}{(1,4)}\underbrace{a^{[1]}}{(4,1)} + \underbrace{b^{[2]}}{(1,1)} \ \underbrace{a^{[2]}}{1,1} &= \underbrace{\sigma{(z^{[2]})}}{(1,1)} \ \end{align} $$
-
From
$\eqref{eq1}$ and$\eqref{eq3}$ the vectorised implementation will looks as follows : -
$$ \begin{align} z^{[1]} &= W^{[1]}a^{[0]} + b^{[1]} \ a^{[1]} &= \sigma{(z^{[1]})} \ z^{[2]} &= W^{[2]}a^{[1]} + b^{[2]} \ a^{[2]} &= \sigma{(z^{[2]})} = \hat{y}\ \end{align} $$
-
-
More generally,
-
$$ \begin{align} Z^{[l]} &= W^{[l]}A^{[l-1]} + b^{[l]} \label{eq19}\ A^{[l]} &= g^{[l]}{(Z^{[l]})} \label{eq20}\ \end{align} \ \text{Where }g(z) \text{ represents some activation function} $$
-
This involves computing the activations and output of all layers in the forward pass :
-
for i=1 to m:
$$ \begin{align} z^{1} &= W^{1}a^{0} + b^{1} \ a^{1} &= \sigma{(z^{1})} \ z^{2} &= W^{2}a^{1} + b^{2} \ a^{2} &= \sigma{(z^{2})} = \hat{y}^{(i)}\ \end{align} $$ Recall that, $$ X = \begin{bmatrix} \vdots & \vdots & \cdots & \vdots\ X^{(1)} & X^{(2)} & \cdots & X^{(m)}\ \vdots & \vdots & \cdots & \vdots\ \end{bmatrix} \in \mathbb{R^{n \times m}} $$ -
By stacking all column vectors we can get the vectorised form over all training examples this :
- $$ \begin{align} Z^{[1]} &= W^{[1]}A^{[0]} + b^{[1]} \ A^{[1]} &= \sigma{(Z^{[1]})} \ Z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \ A^{[2]} &= \sigma{(Z^{[2]})} = \hat{y}\ \end{align} \ \text{where,}\ Z^{[1]} = \begin{bmatrix} \vdots & \vdots & \cdots & \vdots\ z^{1} & z^{1} & \cdots & z^{1}\ \vdots & \vdots & \cdots & \vdots\ \end{bmatrix} \in \mathbb{R^{4 \times m}} \ \text{and,}\ A^{[1]} = \begin{bmatrix} \vdots & \vdots & \cdots & \vdots\ a^{1} & a^{1} & \cdots & a^{1}\ \vdots & \vdots & \cdots & \vdots\ \end{bmatrix} \in \mathbb{R^{4 \times m}} $$
-
More generally the forward propagation can be written as :
-
$$ \begin{align} Z^{[1]} &= W^{[1]}A^{[0]} + b^{[1]} \ A^{[1]} &= g^{[1]}{(Z^{[1]})} \ Z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \ A^{[2]} &= g^{[2]}{(Z^{[2]})} = \hat{y}\ \end{align} \ \text{Where }g(z) \text{ represents some activation function} $$
-
Sigmoid :
-
Activation :
$g(z) = \sigma{(z)} = \frac{1}{1+e^{-z}}$ -
Derivative :
$g'(z) = \sigma{(z)}(1 - \sigma{(z)}) = \frac{1}{1+e^{-z}}(1 - \frac{1}{1+e^{-z}})$ -
Only used generally in the final layer if the target variable
$y \in (0,1)$ , say, in case of predicting probabilities.
-
-
Tanh :
-
Activation :
$g(z) = tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$ -
Derivative :
$g'(z) = 1 - tanh^2(z) = 1 - (\frac{e^z - e^{-z}}{e^z + e^{-z}})^2$ -
One of the downsides of sigmoid and tang functions is that if
$z$ is very large then the gradient of the activation function becomes very small and this can slow down gradient descent.
-
-
ReLU (Rectified Linear Unit) :
-
Activation :
$g(z) = max(0,z)$ -
Derivative : $g'(z) = \begin{cases} 0 , & \text{if } z \lt 0 \ 1, & \text{If } z \ge 0\end{cases}$
-
Derivative at
$0$ is not defined but its value is calculated as$0.000000000000001$ or something very small.
-
-
Leaky ReLU :
- Activation :
$g(z) = max(0.01z, z)$ - Derivative : $g'(z) = \begin{cases} 0.01 , & \text{if } z \lt 0 \ 1, & \text{If } z \ge 0\end{cases}$
- Sometimes also called the parametric ReLU where
$g(z) = max(az, z)$ and $g'(z) = \begin{cases} a , & \text{if } z \lt 0 \ 1, & \text{If } z \ge 0\end{cases}$
- Activation :
-
Why do we need activation functions which are non-linear?
-
Suppose we have no activation function : $$ \begin{align} Z^{[1]} &= W^{[1]}A^{[0]} + b^{[1]} \ A^{[1]} &= Z^{[1]} \ Z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \ A^{[2]} &= Z^{[2]} \ \end{align} \ $$
-
Then we can show that no matter how many hidden layers you have the NN will essentially fit a linear function (Linear Regression) to it :
-
$$ \begin{align} Z^{[2]} &= W^{[2]}A^{[1]} + b^{[2]} \ &= W^{[2]}Z^{[1]} + b^{[2]} \ &= W^{[2]}(W^{[1]}A^{[0]} + b^{[1]}) + b^{[2]} \ &= W^{[2]}W^{[1]}A^{[0]} + (W^{[2]}b^{[1]} + b^{[2]}) \ &= W'X + b' \ \end{align} $$
-
Similarly, if all but activations for intermediate layers were linear and the activation for the output layer was a sigmoid function then the NN is equivalent to a Logistic Regression
-
-
Dimensionality sanity check :
-
$$ \begin{align} W^{[l]} &\quad\text{and}\quad &dW^{[l]} &\in (n^{[l]} \times n^{[l-1]}) \ b^{[l]} &\quad\text{and}\quad &db^{[l]} &\in (n^{[l]} \times 1) \ Z^{[l]} &\quad\text{and}\quad &dZ^{[l]} &\in (n^{[l]} \times 1) \ A^{[l]} &\quad\text{and}\quad &dA^{[l]} &\in (n^{[l]} \times 1) \ \end{align} \ \text{where } n^{[l]} \text{ denotes the number of units in } l \text{ layer} $$
-
The computational graph for a neuron looks like this for some activation fn
$g(z)$ : -
$$ \begin{matrix}\boxed{x} \ \boxed{w} \ \boxed{b} \end{matrix} \color{red}{\xleftarrow[db]{dw}} \boxed{\boxed{z = w^Tx + b} \color{red}{\xleftarrow{dz}} \boxed{a = g(z)}} \color{red}{\xleftarrow{da}} \boxed{L(a,y)} $$
-
Accordingly
$da$ and$dz$ obtained from the chain rule are as follows : -
$$ \begin{align} da &= -\frac{y}{a} + \frac{1-y}{1-a} \ dz &= da \times \frac{\partial{a}}{\partial{z}} \ &= da. g'(z) \end{align} $$
-
The computational graph for our network is :
-
$$ \begin{matrix}\boxed{X} \ \boxed{W^{[1]}} \ \boxed{b^{[1]}} \end{matrix} \color{red}{\xleftarrow[db^{[1]}]{dW^{[1]}}} \boxed{Z^{[1]} = W^{[1]}X + b^{[1]}} \color{red}{\xleftarrow{dZ^{[1]}}} \boxed{A^{[1]} = g^{[1]}(Z^{[1]})} \color{red}{\xleftarrow{dA^{[1]}}} \ \boxed{Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}} \color{red}{\xleftarrow{dZ^{[2]}}} \boxed{A^{[2]} = g^{[2]}(Z^{[2]})} \color{red}{\xleftarrow{dA^{[2]}}} \boxed{L(a,y)} $$
-
Applying the same chain rule logic, we get the following vectorised form:
$$ \begin{align} dZ^{[2]} &= \begin{cases} A^{[2]} - y, & \text{if } g^{[2]}(Z^{[2]}) = \sigma{(Z^{[2]})} \ dA^{[2]}.g'^{[2]}(Z^{[2]}), & \text{otherwise} \end{cases} \ dW^{[2]} &= dZ^{[2]}\frac{\partial{Z^{[2]}}}{\partial{W^{[2]}}} \ &= \frac{1}{m}dZ^{[2]}A^{[1]T} \ db^{[2]} &= dZ^{[2]}\frac{\partial{Z^{[2]}}}{\partial{b^{[2]}}} \ &= \frac{1}{m}dZ^{[2]} \ dZ^{[1]} &= dZ^{[2]}\frac{\partial{Z^{[2]}}}{\partial{A^{[1]}}}\frac{\partial{A^{[1]}}}{\partial{Z^{[1]}}} \ &= W^{[2]T}dZ^{[2]} * g'^{[1]}(Z^{[1]}) \text{,} \quad \text{here} * \text{implies the element-wise product} \ dW^{[1]} &= dZ^{[1]}\frac{\partial{Z^{[1]}}}{\partial{W^{[1]}}} \ &= \frac{1}{m}dZ^{[1]}A^{[0]T} \ db^{[1]} &= dZ^{[1]}\frac{\partial{Z^{[1]}}}{\partial{b^{[1]}}} \ &= \frac{1}{m}dZ^{[1]} \ \end{align} $$
-
dZ2 = A2 - Y dW2 = np.dot(dZ2, A1.T) / m db2 = np.sum(dZ2, keepdims=True, axis=1) / m # keepdims avoids rank one matrices dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2)) # element-wise product dW1 = np.dot(dZ1, X.T) / m db1 = np.sum(dZ1, keepdims=True, axis=1) / m
-
Dont initialise
$W$ to zeroes as the neural network does not end up learning. Ie, $$ \begin{align} &\text{If, } \ W^{[1]} &= \begin{bmatrix} 0 & 0 \ 0 & 0 \ \end{bmatrix}\ &\text{then, } \ a^{[1]}_1 &= a^{[2]}_1 \ dz^{[1]}_1 &= dz^{[2]}_1 \ dw &= \begin{bmatrix} u & v \ u & v \ \end{bmatrix} \ &\text{and,} \ W^{[1]} &= W^{[1]} - \alpha dW = \begin{bmatrix} x & y \ x & y \ \end{bmatrix}\ \end{align} $$ We want$W$ weights to be different for every node. -
So just initialise
$W$ to some random small value likeW1 = np.random.randn((4,3)) * 0.01
-
Why
0.01
and not100
? Well, if$W$ is large then,$Z$ is very large. This corresponds to the slope of the activation function at$Z$ to be small so, GD may be slow.
-
Why
import numpy as np
def initialise_params():
W1 = np.random.randn((4,3)) * 0.01 # randomly initialise Ws to some small value
b1 = np.zeroes((4,1))
W2 = np.random.randn((1,4)) * 0.01
b2 = np.zeroes((1,1))
parameters = {
'W1': W1,
'b1': b1,
'W2': W2,
'b2': b2
}
return parameters
def forward_prop(X, parameters):
Z1 = np.dot(parameters['W1'], X) + parameters['b1']
A1 = np.tanh(Z1)
Z1 = np.dot(parameters['W2'], A1) + parameters['b2']
A2 = sigmoid(Z2)
cache = {
'Z1': Z1,
'A1': A1,
'Z2': Z2,
'A2': A2
}
return cache
def compute_cost(yhat, y):
logprobs = np.multiply(np.log(yhat), y) + np.multiply(np.log(1 - yhat), 1 - y)
cost = - np.sum(logprobs) / m
# cost = -(1.0/m) * np.sum(y*np.log(yhat) + (1-y)*np.log(1-yhat))
# cost = -(1.0/m) * np.sum(np.dot(np.log(yhat), y.T) + np.dot(np.log(1-yhat), (1-y).T))
cost = np.squeeze(cost) # removes 1D entries from the shape of the array
return cost
def back_prop(parameters, cache, X, y):
W1 = parameters['W1']
W2 = parameters['W2']
A1 = cache['A1']
A2 = cache['A2']
dZ2 = A2 - Y
dW2 = np.dot(dZ2, A1.T) / m
db2 = np.sum(dZ2, keepdims=True, axis=1) / m # keepdims avoids rank one matrices
dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2)) # element-wise product
dW1 = np.dot(dZ1, X.T) / m
db1 = np.sum(dZ1, keepdims=True, axis=1) / m
grads = {
'dW1': dW1,
'db1': db1,
'dW2': dW2,
'db2': db2
}
return grads
def update_params(parameters, grads, learning_rate = 1.2):
W1 = parameters['W1']
W2 = parameters['W2']
b1 = parameters['b1']
b2 = parameters['b2']
dW1 = grads['dW1']
dW2 = grads['dW2']
db1 = grads['db1']
db2 = grads['db2']
W1 -= learning_rate * dW1
W2 -= learning_rate * dW2
b1 -= learning_rate * db1
b2 -= learning_rate * db2
parameters = {
'W1': W1,
'b1': b1,
'W2': W2,
'b2': b2
}
return parameters
def model(X, y, nh, iterations = 10000):
nx = X.shape[0]
ny = y.shape[0]
np.random.seed(3)
parameters = initialise_params(nx=nx, nh=nh, ny=ny)
# Loop for GD
for i in range(0, iterations):
# Forward prop
cache = forward_prop(X, parameters)
# Cost
cost = compute_cost(y=y, yhat=cache['A2'])
# Back prop
grads = back_prop(X=X, cache=cache, parameters=parameters, y=y)
# Update params
parameters = update_params(grads=grads, parameters=parameters)
if i % 1000 == 0:
print(cost)
return parameters
-
$L$ denotes the total number of layers in the network -
$n^{[l]}$ denotes the number of neurons in the layer$l$ - And we include equations
$\eqref{eq19}$ and$\eqref{eq20}$ for the most general form of calculating$Z^{[l]}$ and$A^{[l]}$ - In
$\eqref{eq19}$ we crucially miss that$b^{[l]}$ is broadcasted. So,$b^{[1]} \in (n^{[1]}, m)$ and$Z^{[l]}, A^{[l]}, dZ^{[l]}, dA^{[l]} \in (n^{[l]}, m)$
From circuit theory, it follows that a single layer would require an exponential number of neurons to give the same result as a number of shallower layers (with lesser number of neurons).
-
Vectorised Forward prop is given by :
-
$$ \begin{align} Z^{[l]} &= W^{[l]}A^{[l-1]} + b^{[l]} \ A^{[l]} &= g^{[l]}(Z^{[l]}) \end{align} $$
-
Back prop is given by :
-
$$ \begin{align} dZ^{[l]} &= dA^{[l]} * g'^{[l]}(Z^{[l]}) \ dW^{[l]} &= \frac{1}{m} dZ^{[l]}A^{[l-1]T} \ db^{[l]} &= \frac{1}{m}np.sum(dZ^{[l]}, axis = 1, keepdims = True) \ dA^{[l-1]} &= W^{[l]T}dZ^{[l]} \ \end{align} $$
-
dZ = np.multiply(dA, activation_differential(Z)) dW = np.dot(dZ, A_prev.T) / m db = np.sum(dZ, axis=1, keepdims=True) / m dA_prev = np.dot(W.T, dZ)
- What are hyper parameters ?
- Learning rate
$\alpha$ - Number of iterations
- Number of hidden layers
$L$ - Number of neurons in each layer
$n^{[1]}, \cdots, n^{[l]}$ - Choice of activation function
- Momentum
- Mini-batch size
- Regularisation parameters
$\cdots$
- Learning rate
- The idea is to play around with the hyper parameters so that the cost decreases wrt number of iterations