import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy
model = Sequential([
Dense(units=25, activation='sigmoid'),
Dense(units=15, activation='sigmoid')
Dense(units=1, activation='sigmoid')
])
model.compile(loss=BinaryCrossentropy())
model.fit(X, Y, epochs=100)
tip) epochs
is the number of steps in gradient decent
-
Specify how to compute output given input
$x$ and parameters$w, b$ (define model)$f_{\vec{w}, b}(\vec{x}) = ?$ model = Sequential([ Dense(units=25, activation='sigmoid'), Dense(units=15, activation='sigmoid') Dense(units=1, activation='sigmoid') ])
-
Specify loss and cost function
Loss function:
$L(f_{\vec{w}, b}(\vec{x}), y)$ Cost function:
$J(\vec{w}, b) = \dfrac{1}{m}\displaystyle\sum_{i = 1}^{m}{L(f_{\vec{w}, b}(\vec{x}), y)}$ model.compile(loss=BinaryCrossentropy())
-
Train on data to minimize
$J(\vec{w}, b)$ This will run the gradient decent algorithm for every unit of every layer to get the minimal parameters(
$W$ and$b$ )model.fit(X, Y, epochs=100)
Most of the time there is a fairly natural choice for the last layers neuron based on the output type. for example if it's a binary classification, it would be Sigmoid, and if it's Regression it would be either ReLU or Linear.
The ReLU activation has replaces the Sigmoid as the most popular activation function for the following reasons:
- ReLU is faster to compute as it's simpler than Sigmoid
- Sigmoid goes flat in two places(start and end) whereas ReLU goes flat in just one place(
$z \leq 0$ ); so the Gradient Decent performs a lot slower in ReLU - ReLU learns faster as the hidden layer
Here is the reason why a neural network with just Linear activation functions wont work:
Suppose we have a two layer neural network with 1 Linear neuron per each layer:
So the result would be just another Linear function.
Similarly using Linear activation function as all hidden layers and using a Sigmoid activation function as output layer would result in a simple Sigmoid function.
tip) Don't just use just the Linear activation functions in hidden layers.
tip) Using just ReLU as hidden layers would do just fine
Given an input, Softmax calculates the probability of that input being under each of the existing categories.
The
tip) If we apply the Softmax regression for
Since the softmax only computes the probability of one given category, we will need as many neurons as existing categories in our last Softmax layer e.g. for handwritten number classification we will need 10 Softmax neurons.
note) In other activation functions
import tensorflow as tf
from tensorflow.keras import Input, Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model = Sequential([
Input(shape=(400,)),
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu')
Dense(units=10, activation='softmax')
])
# sparse means that y can only take one of the categories
model.compile(loss=SparseCategoricalCrossentropy())
model.fit(X, Y, epochs=100)
Even though this code works, there is a better implementation for this purpose:
If instead of computing the softmax value as model output, we use linear function as the activation of the last layer; the numerical round-off error will be less and the model will be more accurate.
In this practice:
- Set
linear
as activation of last layer - Set the number of last layer units equal to the number of output classes
- Pass
from_logits=True
argument to loss function - To get the probabilities, pass the predictions to the right activation in
tf.nn
namespace
The default predictions will be varied positive and negative numbers instead of the probabilities. To just get the final result, we don't need to calculate the probabilities and we can just return the index of largest output number.
import tensorflow as tf
from tensorflow.keras import Input, Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy
model = Sequential([
Input(shape=(400,)),
Dense(units=25, activation='relu'),
Dense(units=15, activation='relu')
Dense(units=10, activation='linear') # linear instead of softmax
])
model.compile(loss=SparseCategoricalCrossentropy(from_logits=True)) # <- from_logits=True
model.fit(X, Y, epochs=100)
logits = model.predict(X) # positive and negative numbers
f_x = tf.nn.softmax(logits) # calculate the probabilities
In an application that needs to predict whether there are cars, busses and people in an image; the output should be an array of three binary number digits.
So the output layer has three Sigmoid neurons.
note) This is called Multi label classification
Adam stands for Adaptive Moment estimation and could adjust the learning rate
In Adam algorithm, every parameter has its own learning rate.
Adam is the most famous optimizer and is a safe choice for most of the applications.
In Convolutional layers, each neuron can look at just a fractional part of the input.
Some of the benefits of Convolutional Neural Networks(CNNs):
- Faster computation
- Need less training data
- Less prone to overfitting